Overview

Claims management requires different levels of check before a claim can be approved and a payment can be made. In unexpected events, customers expect their insurer to provide support as soon as possible. With the increasing needs and customer’s expectation, BNP Paribas Cardiff offered a challenge in Kaggle to accelerate it’s claims process.

In the challenge, BNP Paribas Cardif provided an anonymized dataset with two categories of claims:

  1. claims for which approval could be accelerated leading to faster payments

  2. claims for which additional information is required before approval.

The task is to predict the category of a claim based on features available early in the process.

The main motivation for the project is to firstly, explore and gain some insights from the dataset, and secondly to apply machine learning models to predict the outcome.

The evaluation metric used to assess model accuracy is Log Loss defined as

\[logLoss =-\frac{1}{N}\sum_{i=1}^N (y_{i}logp_{i} + (1-y_{i})log(1-p_{i}))\]

where N is the number of observations, log is the natural logarithm, \(y_{i}\) is the binary target, and \(p_{i}\) is the predicted probability that \(y_{i}\) equals 1.

Data

BNP provided two datasets - a training and test dataset for submission. We only used the training dataset in the project as the test dataset is used for the competition. The dataset contains both categorical and numeric variables available when the claims were received by BNP Paribas Cardif. All string type variables are categorical. There are no ordinal variables. The “target” variable in the training dataset is the variable to predict and is equal to 1 for claims suitable for an accelerated approval.

Read the dataset

library(vcd)
library(readr)
library(ggplot2)
library(gridExtra)
library(corrplot)
library(dplyr)
library(tidyr)
library(mice)
library(VIM)
library(caret)
theme_set(theme_bw(base_size = 20))

## Download the csv.gz file from https://github.com/ChoonC/CS107_Project
filename <- "train.csv.gz"
train <- read_csv(gzfile(filename)) 

#convert character variables to factor
train[sapply(train, is.character)] <- lapply(train[sapply(train, is.character)], as.factor)

Data Exploration

There are 114,321 observations and 133 variables in the dataset. Apart from the variables “ID” and “target”, all other variables/features (V1-V131) are anonymized meaning we don’t know what the values represent. There are no duplicates ID in the dataset, each observation is a unique claim.

Typically with personal insurance claims data one would expect information such as policy number, customer ID, contract date, claims date, events date, product, subproduct, age, processed date, etc.. As the data is anonymized, it is unknown what has been provided and the process taken to anonymize the data.

dim(train)
## [1] 114321    133
str(train)
## Classes 'tbl_df', 'tbl' and 'data.frame':    114321 obs. of  133 variables:
##  $ ID    : int  3 4 5 6 8 9 12 21 22 23 ...
##  $ target: int  1 1 1 1 1 0 0 1 0 1 ...
##  $ v1    : num  1.336 NA 0.944 0.797 NA ...
##  $ v2    : num  8.73 NA 5.31 8.3 NA ...
##  $ v3    : Factor w/ 3 levels "A","B","C": 3 3 3 3 3 3 3 3 NA 3 ...
##  $ v4    : num  3.92 NA 4.41 4.23 NA ...
##  $ v5    : num  7.92 9.19 5.33 11.63 NA ...
##  $ v6    : num  2.6 NA 3.98 2.1 NA ...
##  $ v7    : num  3.18 NA 3.93 1.99 NA ...
##  $ v8    : num  0.0129 2.3016 0.0196 0.1719 NA ...
##  $ v9    : num  10 NA 12.67 8.97 NA ...
##  $ v10   : num  0.503 1.313 0.766 6.543 1.05 ...
##  $ v11   : num  16.4 NA 14.8 16.3 NA ...
##  $ v12   : num  6.09 6.51 6.38 9.65 6.32 ...
##  $ v13   : num  2.87 NA 2.51 3.9 NA ...
##  $ v14   : num  11.6 11.6 9.6 14.1 11 ...
##  $ v15   : num  1.36 NA 1.98 1.95 NA ...
##  $ v16   : num  8.57 NA 5.88 5.52 NA ...
##  $ v17   : num  3.67 NA 3.17 3.61 NA ...
##  $ v18   : num  0.107 NA 0.245 1.224 NA ...
##  $ v19   : num  0.149 NA 0.144 0.232 NA ...
##  $ v20   : num  18.9 NA 18 18.4 NA ...
##  $ v21   : num  7.73 6.76 5.25 7.52 6.41 ...
##  $ v22   : Factor w/ 18210 levels "AA","AAA","AAAA",..: 16671 7734 7087 1511 8038 4580 11283 3862 7522 12770 ...
##  $ v23   : num  -1.72e-08 NA -2.79e-07 -4.81e-07 NA ...
##  $ v24   : Factor w/ 5 levels "A","B","C","D",..: 3 3 5 4 5 1 5 4 5 3 ...
##  $ v25   : num  0.139 3.056 0.114 0.149 NA ...
##  $ v26   : num  1.72 NA 2.24 1.31 NA ...
##  $ v27   : num  3.39 NA 5.31 2.3 NA ...
##  $ v28   : num  0.59 NA 0.836 8.927 NA ...
##  $ v29   : num  8.88 NA 7.5 8.87 NA ...
##  $ v30   : Factor w/ 7 levels "A","B","C","D",..: 3 3 NA 3 NA NA 7 3 NA 3 ...
##  $ v31   : Factor w/ 3 levels "A","B","C": 1 1 1 2 1 1 1 1 NA 1 ...
##  $ v32   : num  1.08 NA 1.45 1.59 NA ...
##  $ v33   : num  1.01 NA 1.73 1.67 NA ...
##  $ v34   : num  7.27 3.62 4.04 8.7 6.08 ...
##  $ v35   : num  8.38 NA 7.96 8.9 NA ...
##  $ v36   : num  11.3 14.6 12.7 11.3 NA ...
##  $ v37   : num  0.455 NA 0.26 0.434 NA ...
##  $ v38   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ v39   : num  4.012 NA 7.379 0.287 NA ...
##  $ v40   : num  7.71 14.31 13.08 11.52 10.14 ...
##  $ v41   : num  7.65 NA 6.17 7.93 NA ...
##  $ v42   : num  12.7 NA 12.3 12.9 NA ...
##  $ v43   : num  2.02 NA 2.93 1.47 NA ...
##  $ v44   : num  10.5 NA 8.9 12.7 NA ...
##  $ v45   : num  9.85 NA 5.34 9.67 NA ...
##  $ v46   : num  0.114 2.45 0.126 0.108 NA ...
##  $ v47   : Factor w/ 10 levels "A","B","C","D",..: 3 5 3 3 9 9 3 9 4 9 ...
##  $ v48   : num  12.2 NA 12.7 12.2 NA ...
##  $ v49   : num  8.09 NA 6.84 8.59 NA ...
##  $ v50   : num  0.899 1.379 0.605 3.329 1.365 ...
##  $ v51   : num  7.28 NA 9.64 4.78 NA ...
##  $ v52   : Factor w/ 12 levels "A","B","C","D",..: 7 7 6 8 8 11 1 3 8 1 ...
##  $ v53   : num  16.7 NA 15.1 16.6 NA ...
##  $ v54   : num  0.0371 1.1295 0.0856 0.1397 NA ...
##  $ v55   : num  1.3 NA 0.765 1.178 NA ...
##  $ v56   : Factor w/ 122 levels "A","AA","AB",..: 86 102 18 48 NA 101 18 93 6 NA ...
##  $ v57   : num  3.97 NA 4.03 3.97 NA ...
##  $ v58   : num  0.53 NA 4.28 1.73 NA ...
##  $ v59   : num  10.89 NA 9.11 11.78 NA ...
##  $ v60   : num  1.59 NA 2.15 1.23 NA ...
##  $ v61   : num  15.9 NA 16.1 15.9 NA ...
##  $ v62   : int  1 2 1 1 1 1 2 2 0 3 ...
##  $ v63   : num  0.153 2.545 0.124 0.14 NA ...
##  $ v64   : num  6.36 NA 5.52 6.29 NA ...
##  $ v65   : num  18.3 NA 16.4 17 NA ...
##  $ v66   : Factor w/ 3 levels "A","B","C": 3 1 1 1 3 1 1 3 2 1 ...
##  $ v67   : num  9.31 NA 8.37 9.7 NA ...
##  $ v68   : num  15.2 NA 11 18.6 NA ...
##  $ v69   : num  17.14 NA 5.88 9.43 NA ...
##  $ v70   : num  11.78 12.05 8.46 13.59 NA ...
##  $ v71   : Factor w/ 9 levels "A","B","C","D",..: 5 5 2 5 5 5 5 5 2 5 ...
##  $ v72   : int  1 2 3 2 1 1 2 2 0 3 ...
##  $ v73   : num  1.61 NA 2.41 2.27 NA ...
##  $ v74   : Factor w/ 3 levels "A","B","C": 2 2 2 2 2 2 2 2 2 2 ...
##  $ v75   : Factor w/ 4 levels "A","B","C","D": 4 4 2 4 4 4 4 4 2 4 ...
##  $ v76   : num  2.23 NA 1.96 2.19 NA ...
##  $ v77   : num  7.29 NA 5.92 8.21 NA ...
##  $ v78   : num  8.57 NA 11.76 13.45 NA ...
##  $ v79   : Factor w/ 18 levels "A","B","C","D",..: 5 4 5 2 3 9 5 3 16 3 ...
##  $ v80   : num  3 NA 3.33 1.95 NA ...
##  $ v81   : num  7.53 7.28 10.19 4.8 NA ...
##  $ v82   : num  8.86 3.43 8.27 13.32 NA ...
##  $ v83   : num  0.65 NA 1.53 1.68 NA ...
##  $ v84   : num  1.3 NA 1.53 1.38 NA ...
##  $ v85   : num  1.71 NA 2.43 1.59 NA ...
##  $ v86   : num  0.866 NA 1.071 1.243 NA ...
##  $ v87   : num  9.55 9.85 8.45 10.75 NA ...
##  $ v88   : num  3.32 NA 3.37 1.41 NA ...
##  $ v89   : num  0.0957 2.6786 0.1114 0.0391 NA ...
##  $ v90   : num  0.905 NA 0.811 1.042 NA ...
##  $ v91   : Factor w/ 7 levels "A","B","C","D",..: 1 2 7 2 7 7 2 1 3 1 ...
##  $ v92   : num  0.442 NA 0.271 0.764 NA ...
##  $ v93   : num  5.81 NA 5.16 5.5 NA ...
##  $ v94   : num  3.52 NA 4.21 3.42 NA ...
##  $ v95   : num  0.462 NA 0.31 0.833 NA ...
##  $ v96   : num  7.44 NA 5.66 7.38 NA ...
##  $ v97   : num  5.45 NA 5.97 6.75 NA ...
##   [list output truncated]
n_obs <- train %>% count(unique(ID)) %>% nrow()
n_obs
## [1] 114321
table(train$target)
## 
##     0     1 
## 27300 87021
prop.table(table(train$target))
## 
##         0         1 
## 0.2388013 0.7611987
#count number of numeric features
sum(sapply(train, is.numeric))
## [1] 114
#count number of categorical features
sum(sapply(train, is.factor))
## [1] 19
#Create tables by data types for exploration, keep ID and target on both for sorting
train_num <- train[,sapply(train, is.numeric)] %>% 
              arrange(-target, ID) 
train_fac <- bind_cols(select(train, c(ID, target)), 
                       train[,sapply(train, is.factor)]) %>% 
            arrange(-target, ID)

87,021 or 76.1% of the claims were classified as suitable for accelerated approval. There are 112 numeric features in addition to the ID and target variables, and 19 categorical features.

Numeric features

Four of the numeric features contains discrete values (v38, v62, v72, v129) which might suggest some kind of counts. All other variables are continuous.

From the summary, we can see that a large number of the numeric features have missing values and most have values in the range 0 - 20. (v14 has 40 values in the order of -0.000001, which we regard as 0).

#summarise the features
sapply(select(train_num, -ID),summary)
## $target
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  1.0000  1.0000  0.7612  1.0000  1.0000 
## 
## $v1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    0.91    1.47    1.63    2.14   20.00   49832 
## 
## $v2
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    5.32    7.02    7.46    9.46   20.00   49796 
## 
## $v4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    3.49    4.21    4.14    4.83   20.00   49796 
## 
## $v5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    7.61    8.67    8.74    9.77   20.00   48624 
## 
## $v6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    2.06    2.41    2.44    2.77   20.00   49832 
## 
## $v7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    2.10    2.45    2.48    2.83   20.00   49832 
## 
## $v8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    0.09    0.39    1.50    1.62   20.00   48619 
## 
## $v9
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    7.85    9.06    9.03   10.23   20.00   49851 
## 
## $v10
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   1.050   1.313   1.883   2.101  18.530      84 
## 
## $v11
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   15.00   15.50   15.45   15.95   20.00   49836 
## 
## $v12
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   6.322   6.613   6.881   7.020  18.710      86 
## 
## $v13
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    3.07    3.59    3.80    4.29   20.00   49832 
## 
## $v14
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
## -0.000001 11.260000 11.970000 12.090000 12.720000 20.000000         4 
## 
## $v15
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    1.61    1.99    2.08    2.42   20.00   49836 
## 
## $v16
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    3.87    4.93    4.92    5.96   20.00   49895 
## 
## $v17
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    2.70    3.55    3.83    4.51   20.00   49796 
## 
## $v18
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    0.51    0.77    0.84    1.07   20.00   49832 
## 
## $v19
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    0.17    0.20    0.22    0.24   20.00   49843 
## 
## $v20
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.52   17.33   18.04   17.77   18.54   20.00   49840 
## 
## $v21
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.1062  6.4160  7.0450  7.0300  7.6710 19.3000     611 
## 
## $v23
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    0.00    0.00    1.09    0.00   20.00   50675 
## 
## $v25
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.04    0.15    0.47    1.70    1.95   20.00   48619 
## 
## $v26
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    1.51    1.83    1.88    2.18   20.00   49832 
## 
## $v27
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    2.20    2.67    2.74    3.22   20.00   49832 
## 
## $v28
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    3.49    5.04    5.09    6.58   19.85   49832 
## 
## $v29
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    7.43    8.30    8.21    9.09   20.00   49832 
## 
## $v32
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    1.26    1.56    1.62    1.90   17.56   49832 
## 
## $v33
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    1.47    1.95    2.16    2.63   20.00   49832 
## 
## $v34
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   5.054   6.537   6.406   7.703  20.000     111 
## 
## $v35
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    7.25    8.07    8.12    8.94   20.00   49832 
## 
## $v36
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   11.77   13.77   13.38   15.32   20.00   48624 
## 
## $v37
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    0.40    0.64    0.74    0.95   20.00   49843 
## 
## $v38
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.09093  0.00000 12.00000 
## 
## $v39
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    0.13    0.38    1.24    1.19   19.92   49836 
## 
## $v40
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   8.408  10.330  10.470  12.770  20.000     111 
## 
## $v41
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    6.54    7.20    7.18    7.83   20.00   49832 
## 
## $v42
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   12.34   12.93   12.92   13.49   20.00   49832 
## 
## $v43
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    1.79    2.15    2.22    2.56   20.00   49836 
## 
## $v44
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    9.57   10.78   10.80   12.02   19.83   49796 
## 
## $v45
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    7.83    9.16    9.14   10.42   20.00   49832 
## 
## $v46
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.07    0.12    0.44    1.63    1.82   20.00   48619 
## 
## $v48
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   11.22   12.41   12.54   13.78   20.00   49796 
## 
## $v49
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    7.47    8.02    8.02    8.56   20.00   49832 
## 
## $v50
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.0000  0.6588  1.2120  1.5040  2.0070 20.0000      86 
## 
## $v51
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    5.60    7.13    7.20    8.64   20.00   50678 
## 
## $v53
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   15.28   15.77   15.71   16.22   20.00   49836 
## 
## $v54
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.01    0.09    0.31    1.25    1.41   20.00   48619 
## 
## $v55
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    0.98    1.37    1.56    1.94   20.00   49832 
## 
## $v57
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    3.65    4.07    4.08    4.49   20.00   49832 
## 
## $v58
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    1.50    5.33    7.70   13.96   20.00   49836 
## 
## $v59
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    9.06   10.54   10.59   12.03   20.00   49796 
## 
## $v60
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    1.36    1.67    1.71    2.01   20.00   49832 
## 
## $v61
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   13.60   15.08   14.58   16.11   18.85   49796 
## 
## $v62
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   1.000   1.031   1.000   7.000 
## 
## $v63
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.05    0.14    0.46    1.69    1.85   20.00   48619 
## 
## $v64
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    4.79    6.11    6.34    7.52   20.00   49796 
## 
## $v65
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.66   15.03   16.26   15.85   17.16   20.00   49840 
## 
## $v67
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    8.58    9.31    9.29    9.99   20.00   49832 
## 
## $v68
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.50   17.08   18.27   17.56   18.91   20.00   49836 
## 
## $v69
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    8.39    9.52    9.45   10.54   20.00   49895 
## 
## $v70
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.43   10.80   12.49   12.27   13.99   19.82   48636 
## 
## $v72
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   1.000   1.432   2.000  12.000 
## 
## $v73
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    1.90    2.33    2.43    2.85   20.00   49836 
## 
## $v76
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    1.63    2.17    2.40    2.81   20.00   49796 
## 
## $v77
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    6.50    7.38    7.31    8.16   15.97   49832 
## 
## $v78
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   12.31   13.33   13.33   14.39   20.00   49895 
## 
## $v80
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    1.45    2.09    2.21    2.86   20.00   49851 
## 
## $v81
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    5.98    7.52    7.29    8.78   20.00   48624 
## 
## $v82
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    3.38    3.69    6.21    8.79   20.00   48624 
## 
## $v83
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    1.43    1.94    2.17    2.67   20.00   49832 
## 
## $v84
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    0.96    1.42    1.61    2.07   20.00   49832 
## 
## $v85
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    1.84    2.60    2.82    3.57   20.00   50682 
## 
## $v86
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    0.94    1.16    1.22    1.42   17.56   49832 
## 
## $v87
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.87    8.05   10.00   10.18   12.23   19.84   48663 
## 
## $v88
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    1.18    1.76    1.92    2.46   20.00   49832 
## 
## $v89
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.02    0.10    0.33    1.52    1.75   20.00   48619 
## 
## $v90
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    0.86    0.97    0.97    1.06    6.31   49836 
## 
## $v92
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    0.44    0.54    0.58    0.68    8.92   49843 
## 
## $v93
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    4.55    5.30    5.47    6.22   20.00   49832 
## 
## $v94
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    3.33    3.74    3.85    4.22   19.02   49832 
## 
## $v95
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    0.50    0.62    0.67    0.77    9.07   49843 
## 
## $v96
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    5.76    6.51    6.46    7.23   20.00   49832 
## 
## $v97
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    6.32    7.45    7.62    8.78   20.00   49843 
## 
## $v98
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    6.12    7.64    7.67    9.06   19.06   48654 
## 
## $v99
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    0.93    1.24    1.25    1.55   20.00   49832 
## 
## $v100
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    5.75   14.48   12.09   18.32   20.00   49836 
## 
## $v101
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    5.26    6.62    6.87    8.24   20.00   49796 
## 
## $v102
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    1.79    2.46    2.89    3.41   20.00   51316 
## 
## $v103
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    4.50    5.13    5.30    5.88   18.78   49832 
## 
## $v104
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    2.14    2.51    2.64    2.95   20.00   49832 
## 
## $v105
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    0.06    0.24    1.08    1.02   20.00   48658 
## 
## $v106
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   10.05   12.09   11.79   13.77   20.00   49796 
## 
## $v108
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    1.53    1.98    2.15    2.54   20.00   48624 
## 
## $v109
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    1.84    3.09    4.18    5.15   20.00   48624 
## 
## $v111
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    2.32    3.11    3.37    4.12   20.00   49832 
## 
## $v114
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   12.00   14.04   13.57   15.37   20.00      30 
## 
## $v115
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    9.46   10.48   10.55   11.61   20.00   49895 
## 
## $v116
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    1.85    2.22    2.29    2.65   20.00   49836 
## 
## $v117
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    5.76    8.07    8.30   10.50   20.00   48624 
## 
## $v118
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    6.98    8.14    8.37    9.56   20.00   49843 
## 
## $v119
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    0.32    1.46    3.17    4.17   20.00   50680 
## 
## $v120
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    0.78    1.15    1.29    1.65   10.39   49836 
## 
## $v121
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    1.79    2.44    2.74    3.38   20.00   49840 
## 
## $v122
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    5.65    6.75    6.82    7.91   20.00   49851 
## 
## $v123
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.02    1.96    2.74    3.55    4.08   19.69   50678 
## 
## $v124
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    0.02    0.14    0.92    0.87   20.00   48619 
## 
## $v126
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    1.42    1.62    1.67    1.84   15.63   49832 
## 
## $v127
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    2.10    2.96    3.24    4.11   20.00   49832 
## 
## $v128
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    1.39    1.80    2.03    2.39   20.00   48624 
## 
## $v129
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3101  0.0000 11.0000 
## 
## $v130
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    1.11    1.56    1.93    2.33   20.00   49843 
## 
## $v131
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    1.01    1.59    1.74    2.26   20.00   49895
# Look at the less than 0 values in v14
select(train_num,v14) %>% 
  filter(v14<0) %>% 
  arrange(-v14)
## Source: local data frame [40 x 1]
## 
##              v14
##            (dbl)
## 1  -2.234954e-08
## 2  -4.923125e-08
## 3  -6.485695e-08
## 4  -7.864030e-08
## 5  -8.883815e-08
## 6  -1.024361e-07
## 7  -1.381705e-07
## 8  -1.543122e-07
## 9  -1.642419e-07
## 10 -1.713699e-07
## ..           ...

Only 4 of the numeric features have no missing values (v38, v62, v72, v129). The majority (100) of the features have proportion of missing values between 42.5% and 44.5%. The rest of the features have less than 0.053% missing values. The proportion of missing values is slightly greater in claims which were accelerated (target=1) compared to those that require further information (target=0).

# Function to return a table of count and missing pct for all features
getMissing <- function(tbl){
    data.frame(feature.names = colnames(tbl), 
               n_Miss=apply(tbl, 2, function(i){sum(is.na(i))}),
               pct_Miss=apply(tbl, 2, function(i){sum(is.na(i))}/length(i))) %>% 
    arrange(-n_Miss) 
}

#Overall missingness 
missing_num <- getMissing(select(train_num, -ID))
table(missing_num$n_Miss)
## 
##     0     4    30    84    86   111   611 48619 48624 48636 48654 48658 
##     5     1     1     1     2     2     1     7     8     1     1     1 
## 48663 49796 49832 49836 49840 49843 49851 49895 50675 50678 50680 50682 
##     1    11    34    12     3     7     3     5     1     2     1     1 
## 51316 
##     1
#Missingness for target = 1
missing_tgt1 <- getMissing(select(filter(train_num, target == 1), -ID)) 
names(missing_tgt1) <- c("feature.names", "n_Miss_tgt1", "pct_Miss_tgt1")

#Join the tables 
missing_num <- missing_num %>% 
  inner_join(missing_tgt1) 
## Joining by: "feature.names"
slice(missing_num, 1:10)
##    feature.names n_Miss  pct_Miss n_Miss_tgt1 pct_Miss_tgt1
## 1           v102  51316 0.4488764       39472     0.4535917
## 2            v85  50682 0.4433306       39037     0.4485929
## 3           v119  50680 0.4433131       39035     0.4485699
## 4            v51  50678 0.4432956       39033     0.4485469
## 5           v123  50678 0.4432956       39033     0.4485469
## 6            v23  50675 0.4432694       39031     0.4485239
## 7            v16  49895 0.4364465       38424     0.4415486
## 8            v69  49895 0.4364465       38424     0.4415486
## 9            v78  49895 0.4364465       38424     0.4415486
## 10          v115  49895 0.4364465       38424     0.4415486
rm(missing_tgt1)

For the features with high proportion of missing values, the count of the observations with missing values takes on a fixed number of values. This is true also for the target group. This suggests that those groups of features tend to have missing values together and that they might be related to types of claims (product types).

We use the VIM package to get a better understanding of the missing data. (ref:[r-bloggers imputing missing data with r mice package])(http://www.r-bloggers.com/imputing-missing-data-with-r-mice-package/)

62.561 (55%) of the observations are complete in numeric values. 47,715 (42%) have missing values for the same group of features It confirms our suspicion that the same group of festures have missing values. There appears to be some patterns and clusters of observations. The pattern of missing values do not appear to be too different for the target group. The missing values are unlikely to be random but more likely to be related to be types of claims or products and questions asked regarding the claims.

Due to the large proportion of features with missing values, excluding those observations or features would lead to loss of information. We impute the missing values for features where the proportion of missing values is less than 5%, namely v10,v12,v14,v21,v34,v40,v50 and v114. We then add an indicator to the dataset to indicate observations with no missing values and those with missing values.

#Visualise missing data
tmp <- select(train_num, -c(ID, target))
aggr(tmp, col=c('light blue','red'), numbers=TRUE, combined=TRUE, sortVars=TRUE, labels=names(tmp), cex.axis=.7, gap=3, ylab=c("Pattern of missing data"))

## 
##  Variables sorted by number of missings: 
##  Variable Count
##      v102 51316
##       v85 50682
##      v119 50680
##       v51 50678
##      v123 50678
##       v23 50675
##       v16 49895
##       v69 49895
##       v78 49895
##      v115 49895
##      v131 49895
##        v9 49851
##       v80 49851
##      v122 49851
##       v19 49843
##       v37 49843
##       v92 49843
##       v95 49843
##       v97 49843
##      v118 49843
##      v130 49843
##       v20 49840
##       v65 49840
##      v121 49840
##       v11 49836
##       v15 49836
##       v39 49836
##       v43 49836
##       v53 49836
##       v58 49836
##       v68 49836
##       v73 49836
##       v90 49836
##      v100 49836
##      v116 49836
##      v120 49836
##        v1 49832
##        v6 49832
##        v7 49832
##       v13 49832
##       v18 49832
##       v26 49832
##       v27 49832
##       v28 49832
##       v29 49832
##       v32 49832
##       v33 49832
##       v35 49832
##       v41 49832
##       v42 49832
##       v45 49832
##       v49 49832
##       v55 49832
##       v57 49832
##       v60 49832
##       v67 49832
##       v77 49832
##       v83 49832
##       v84 49832
##       v86 49832
##       v88 49832
##       v93 49832
##       v94 49832
##       v96 49832
##       v99 49832
##      v103 49832
##      v104 49832
##      v111 49832
##      v126 49832
##      v127 49832
##        v2 49796
##        v4 49796
##       v17 49796
##       v44 49796
##       v48 49796
##       v59 49796
##       v61 49796
##       v64 49796
##       v76 49796
##      v101 49796
##      v106 49796
##       v87 48663
##      v105 48658
##       v98 48654
##       v70 48636
##        v5 48624
##       v36 48624
##       v81 48624
##       v82 48624
##      v108 48624
##      v109 48624
##      v117 48624
##      v128 48624
##        v8 48619
##       v25 48619
##       v46 48619
##       v54 48619
##       v63 48619
##       v89 48619
##      v124 48619
##       v21   611
##       v34   111
##       v40   111
##       v12    86
##       v50    86
##       v10    84
##      v114    30
##       v14     4
##       v38     0
##       v62     0
##       v72     0
##      v129     0
#Target = 1
tmp <- train_num %>% filter(target==1) %>% select(-c(ID, target))
aggr(tmp, col=c('light blue','red'), numbers=TRUE, combined=TRUE, sortVars=TRUE, labels=names(tmp), cex.axis=.7, gap=3, ylab=c("Pattern of missing data target 1"))

## 
##  Variables sorted by number of missings: 
##  Variable Count
##      v102 39472
##       v85 39037
##      v119 39035
##       v51 39033
##      v123 39033
##       v23 39031
##       v16 38424
##       v69 38424
##       v78 38424
##      v115 38424
##      v131 38424
##        v9 38391
##       v80 38391
##      v122 38391
##       v19 38384
##       v37 38384
##       v92 38384
##       v95 38384
##       v97 38384
##      v118 38384
##      v130 38384
##       v20 38383
##       v65 38383
##      v121 38383
##       v11 38379
##       v15 38379
##       v39 38379
##       v43 38379
##       v53 38379
##       v58 38379
##       v68 38379
##       v73 38379
##       v90 38379
##      v100 38379
##      v116 38379
##      v120 38379
##        v1 38376
##        v6 38376
##        v7 38376
##       v13 38376
##       v18 38376
##       v26 38376
##       v27 38376
##       v28 38376
##       v29 38376
##       v32 38376
##       v33 38376
##       v35 38376
##       v41 38376
##       v42 38376
##       v45 38376
##       v49 38376
##       v55 38376
##       v57 38376
##       v60 38376
##       v67 38376
##       v77 38376
##       v83 38376
##       v84 38376
##       v86 38376
##       v88 38376
##       v93 38376
##       v94 38376
##       v96 38376
##       v99 38376
##      v103 38376
##      v104 38376
##      v111 38376
##      v126 38376
##      v127 38376
##        v2 38350
##        v4 38350
##       v17 38350
##       v44 38350
##       v48 38350
##       v59 38350
##       v61 38350
##       v64 38350
##       v76 38350
##      v101 38350
##      v106 38350
##       v87 37545
##      v105 37540
##       v98 37538
##       v70 37525
##        v5 37517
##       v36 37517
##       v81 37517
##       v82 37517
##      v108 37517
##      v109 37517
##      v117 37517
##      v128 37517
##        v8 37512
##       v25 37512
##       v46 37512
##       v54 37512
##       v63 37512
##       v89 37512
##      v124 37512
##       v21   535
##       v34   102
##       v40   102
##       v12    85
##       v50    85
##       v10    83
##      v114    22
##       v14     4
##       v38     0
##       v62     0
##       v72     0
##      v129     0
rm(tmp)

Distribution of the data

The distribution of the numeric features are mostly either normally distributed or skewed. There are a few features with concave distribution and some features representing intervals (maybe time). The density plot of the variables suggest some features which might be predictive: v10,v14,v2,v34,v36,v40,v50,v58,v100,v106,v114,v98,v119,v123,v129

#Plot the  density by target
plotDensity <- function(df){
  d <- gather(df, key=variable, value=value, -target, na.rm = TRUE) %>%
       mutate(target=as.character(target))
  ggplot(d,aes(x = value, ..scaled..,fill=target)) + 
    geom_density(alpha=0.4) + 
    facet_wrap(~variable,scales = "free") 
}

select(train_num, 2,3:22) %>% plotDensity()

select(train_num, 2,23:42) %>% plotDensity()

select(train_num, 2,43:62) %>% plotDensity()

select(train_num, 2,63:82) %>% plotDensity()

select(train_num, 2,83:102) %>% plotDensity()

select(train_num, 2,103:114) %>% plotDensity()

Categorical Features

One of the categorical feature (v22) have 18,210 levels and two variables with 90 and 122 levels. There are 8 categorical features with no missing values (v24, v47, v66, v71, v74, v75, v79, v110). Two features (v30, v113) with almost 50% missing values, and the rest of the categorical features with less than 6% missing values.

From the visualisation, the proportion of missing value in v113 is greater in the target = 1 group. V3 and v31 have the same observations where their values are missing. They have the same number of categories but have different values.

We looked at the combination of missing values between v30 and v113 and the numeric variables with high proportion of missing values. 19% of v30 had missing values together with the numeric variables, 17% of observations had missing values in v113 only, 12% missing in both v30 and v113, but not the numeric variables, 12% on both the categorical and the numeric features, 8.3% in v30 only and 6.5% on v113 and the numeric features. The missingness of either v30 or v113 do not indicate that the values for the numeric features are also missing.

sapply(train_fac,summary)
## $ID
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       3   57280  114200  114200  171200  228700 
## 
## $target
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  1.0000  1.0000  0.7612  1.0000  1.0000 
## 
## $v3
##      A      B      C   NA's 
##    227     53 110584   3457 
## 
## $v22
##    AGDF     YGJ     QKI     PWR     HZE     MNZ     PTO     ROZ     YOD 
##    2386    2119     668     649     423     382     350     344     303 
##     GBS     VZF     AJQ     WNI     HDD     NRT     PFR     YEP    ADDF 
##     299     296     252     244     243     238     226     224     212 
##     VVI     QVR     WRI    ABOF     PSE     NGS    ACWE     TVR     PTJ 
##     211     210     207     204     199     197     190     182     178 
##    ADMP     UAG    AGON     AHE     NWG     ABF      TG     EJC    AAPP 
##     172     169     167     167     167     166     166     165     164 
##     OFD     AMR    AFYU    ADMI    AFOZ     DJU     PBC     GEJ     AXH 
##     161     160     159     156     152     152     152     150     148 
##     QKP     HUU     MQE     RIC     JGY    ACHJ     KLZ     GEB     WFT 
##     147     146     146     146     138     137     136     132     132 
##    ACXD    ADGN     AWT     NXE    AGZT     BLE     LIP     LJS     TTX 
##     131     129     128     128     127     126     126     124     124 
##     ZLH     CMG     UBK    AAGW    AGMO     QOE     AJO     EIA     ONA 
##     124     123     123     122     122     121     118     117     116 
##     YBV     OPF     GXR     KOX     MLO     AFD     DLW     IAD     EJZ 
##     114     113     112     111     111     109     105     105     104 
##      OP     BQS     RXB     SPT     WPS    ADLK     SBE     UUC    AEXD 
##     104     103     103     103     103     102     102     102     101 
##    AFUF     ESJ     GHP    AHBA     FVQ     NTM     HOF     SEZ (Other) 
##     101     101     101     100     100     100      99      99   93095 
##    NA's 
##     500 
## 
## $v24
##     A     B     C     D     E 
##  3789  8150 20872 26333 55177 
## 
## $v30
##     A     B     C     D     E     F     G  NA's 
##  2313   205 32178  5225  2973  2589  8728 60110 
## 
## $v31
##     A     B     C  NA's 
## 88347 18947  3570  3457 
## 
## $v47
##     A     B     C     D     E     F     G     H     I     J 
##    38    50 55425  3157  5301  4322  3946     1 39071  3010 
## 
## $v52
##     A     B     C     D     E     F     G     H     I     J     K     L 
##  8925  9385  9681  9607  9282  9806  9419  8323 10260 11103  8949  9578 
##  NA's 
##     3 
## 
## $v56
##      BW      DI      AS      BZ      AW      DO       P      DP      CY 
##   11351   10256    8832    7174    6369    5289    4997    4647    3830 
##      CN      BL      AF      BJ      AG      DX      DY      BV       U 
##    3745    3570    2355    2345    2164    2095    1971    1899    1675 
##      DJ       N      DS       V      BX      DH      DF      AL      DR 
##    1660    1638    1513    1477    1387    1298    1119     966     932 
##      BK      BM       G      CS      DA      BQ      CP       R      AZ 
##     921     877     770     698     694     616     598     595     397 
##      BA      CM      AI       Y      AR      CI      AO       Z       A 
##     394     393     323     314     260     222     211     199     186 
##      AN      CC      CF      BH      DL       C      DM      BI      DK 
##     178     153     141     118     117     105     102      97      86 
##      DN      DZ      AK      CO      BS      BG      CH      BY      AA 
##      81      80      69      66      63      62      62      61      60 
##      BU      BF      BD      AY      CA      DG      AH      DE      AU 
##      57      54      41      34      31      26      25      18      17 
##      DU      DV      CB      AC       F      CL      CT      CW      DQ 
##      15      15      13      11      11       9       9       9       9 
##      CX       W      BP      BR      CQ      DW       Q      AV      CV 
##       7       7       6       6       6       6       6       5       5 
##      AT       E       B      BE      BN      DT      AB      AE (Other) 
##       4       4       3       3       3       3       2       2      34 
##    NA's 
##    6882 
## 
## $v66
##     A     B     C 
## 70353 18264 25704 
## 
## $v71
##     A     B     C     D     F     G     I     K     L 
##     1 30255  8947     1 75094     5    16     1     1 
## 
## $v74
##      A      B      C 
##     45 113560    716 
## 
## $v75
##     A     B     C     D 
##    18 39192    24 75087 
## 
## $v79
##     A     B     C     D     E     F     G     H     I     J     K     L 
##   417 25801 34561  5302 25257   571     6  2004  4561   933  4308     1 
##     M     N     O     P     Q     R 
##  3946    49  3331  2217  1006    50 
## 
## $v91
##     A     B     C     D     E     F     G  NA's 
## 27079 22683 23157   230  3206 13418 24545     3 
## 
## $v107
##     A     B     C     D     E     F     G  NA's 
## 13418 22683 24545 23157 27079  3206   230     3 
## 
## $v110
##     A     B     C 
## 55688 55426  3207 
## 
## $v112
##     A     B     C     D     E     F     G     H     I     J     K     L 
##  9545  2688  2055  7327  4748 21671  1702  5651 10224  3967  3252  4479 
##     M     N     O     P     Q     R     S     T     U     V  NA's 
##  1193  9086  3661  4675  2346  4170   833  3980  4803  1883   382 
## 
## $v113
##     A    AA    AB    AC    AD    AE    AF    AG    AH    AI    AJ    AK 
##   334     1  1273  5956   265   680  3568  1712   797   313   872     1 
##     B     C     D     E     F     G     H     I     J     L     M     N 
##  1358   707   142   324   233 16252    41  2605   322   477  7374  1186 
##     O     P     Q     R     S     T     U     V     W     X     Y     Z 
##   100  1975   594   181   614  1608   612  1673  1497  1635  1282   453 
##  NA's 
## 55304 
## 
## $v125
##    A   AA   AB   AC   AD   AE   AF   AG   AH   AI   AJ   AK   AL   AM   AN 
## 1528  542  189 1945  486  833 1116  931  343  905   82 5337 1031  444 1539 
##   AO   AP   AQ   AR   AS   AT   AU   AV   AW   AX   AY   AZ    B   BA   BB 
##  644 3410  469 2229  427 1114 1110  538  766   13 1042 2416 2374  349   68 
##   BC   BD   BE   BF   BG   BH   BI   BJ   BK   BL   BM   BN   BO   BP   BQ 
##  394 2452  864  511  355 1145  720 4465 1477 1214 5759  910  970  504  687 
##   BR   BS   BT   BU   BV   BW   BX   BY   BZ    C   CA   CB   CC   CD   CE 
##  642  240  407 1369  806 2478 1297 3311    3  514 1523  673 1082 2059 1164 
##   CF   CG   CH   CI   CJ   CK   CL    D    E    F    G    H    I    J    K 
##  927 3826  701  397 1446  623  461  969 2521  221 2594 3212  253  907 2835 
##    L    M    N    O    P    Q    R    S    T    U    V    W    X    Y    Z 
## 2502  636 1033  248 1463 1066 1217  809  615 1083 3234  888  867  880 1595 
## NA's 
##   77
#Get the number of levels for each categorical features
tmp <- train_fac %>% select(-c(ID, target))
level<- sapply(tmp,function(x)nlevels(x))
level
##    v3   v22   v24   v30   v31   v47   v52   v56   v66   v71   v74   v75 
##     3 18210     5     7     3    10    12   122     3     9     3     4 
##   v79   v91  v107  v110  v112  v113  v125 
##    18     7     7     3    22    36    90
#Overall missingness 
missing_fac <- getMissing(tmp)
table(missing_fac$n_Miss)
## 
##     0     3    77   382   500  3457  6882 55304 60110 
##     8     3     1     1     1     2     1     1     1
#Missingness for target = 1
missing_tgt1 <- getMissing(select(filter(train_fac, target == 1), -ID)) 
names(missing_tgt1) <- c("feature.names", "n_Miss_tgt1", "pct_Miss_tgt1")

#Join the tables 
missing_fac <- missing_fac %>% 
  inner_join(missing_tgt1)
## Warning in inner_join_impl(x, y, by$x, by$y): joining factors with
## different levels, coercing to character vector
slice(missing_fac, 1:10)
##    feature.names n_Miss     pct_Miss n_Miss_tgt1 pct_Miss_tgt1
## 1            v30  60110 0.5258001592       45881  5.272406e-01
## 2           v113  55304 0.4837606389       45564  5.235978e-01
## 3            v56   6882 0.0601989136        4284  4.922950e-02
## 4             v3   3457 0.0302394136        2780  3.194631e-02
## 5            v31   3457 0.0302394136        2780  3.194631e-02
## 6            v22    500 0.0043736496         410  4.711506e-03
## 7           v112    382 0.0033414683         299  3.435952e-03
## 8           v125     77 0.0006735420          64  7.354547e-04
## 9            v52      3 0.0000262419           3  3.447444e-05
## 10           v91      3 0.0000262419           3  3.447444e-05
#visualise the missing data 
aggr(tmp, col=c('light blue','red'), numbers=TRUE, combined=TRUE, sortVars=TRUE, labels=names(tmp), cex.axis=.7, gap=3, ylab=c("Pattern of mising data"))

## 
##  Variables sorted by number of missings: 
##  Variable Count
##       v30 60110
##      v113 55304
##       v56  6882
##        v3  3457
##       v31  3457
##       v22   500
##      v112   382
##      v125    77
##       v52     3
##       v91     3
##      v107     3
##       v24     0
##       v47     0
##       v66     0
##       v71     0
##       v74     0
##       v75     0
##       v79     0
##      v110     0
#Target = 1
tmp <- train_fac %>% filter(target==1) %>% select(-c(ID, target))
aggr(tmp, col=c('light blue','red'), numbers=TRUE, combined=TRUE, sortVars=TRUE, labels=names(tmp), cex.axis=.7, gap=3, ylab=c("Pattern of missing data target=1","Pattern"))

## 
##  Variables sorted by number of missings: 
##  Variable Count
##       v30 45881
##      v113 45564
##       v56  4284
##        v3  2780
##       v31  2780
##       v22   410
##      v112   299
##      v125    64
##       v52     3
##       v91     3
##      v107     3
##       v24     0
##       v47     0
##       v66     0
##       v71     0
##       v74     0
##       v75     0
##       v79     0
##      v110     0
rm(missing_tgt1, tmp)

We looked at the cross-tabulation of v30 and v113 with two numeric features v8 and v112 to see if there is a specific categorical value coded when the numeric values are missing. There doesn’t appear to be one as there are missing and non-missing numeric values for all categories.

#Visualise v30 and v113 with the numeric variables
feature.names <- t(select(filter(missing_num, pct_Miss > 0.4),feature.names))
feature.names <- c("v30", "v113", feature.names)
tmp <- train[feature.names]
aggr(tmp, col=c('light blue','red'), numbers=TRUE, combined=TRUE, sortVars=TRUE, labels=names(tmp), cex.axis=.7, gap=3, ylab=c("Pattern of missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##  Variable Count
##       v30 60110
##      v113 55304
##      v102 51316
##       v85 50682
##      v119 50680
##       v51 50678
##      v123 50678
##       v23 50675
##       v16 49895
##       v69 49895
##       v78 49895
##      v115 49895
##      v131 49895
##        v9 49851
##       v80 49851
##      v122 49851
##       v19 49843
##       v37 49843
##       v92 49843
##       v95 49843
##       v97 49843
##      v118 49843
##      v130 49843
##       v20 49840
##       v65 49840
##      v121 49840
##       v11 49836
##       v15 49836
##       v39 49836
##       v43 49836
##       v53 49836
##       v58 49836
##       v68 49836
##       v73 49836
##       v90 49836
##      v100 49836
##      v116 49836
##      v120 49836
##        v1 49832
##        v6 49832
##        v7 49832
##       v13 49832
##       v18 49832
##       v26 49832
##       v27 49832
##       v28 49832
##       v29 49832
##       v32 49832
##       v33 49832
##       v35 49832
##       v41 49832
##       v42 49832
##       v45 49832
##       v49 49832
##       v55 49832
##       v57 49832
##       v60 49832
##       v67 49832
##       v77 49832
##       v83 49832
##       v84 49832
##       v86 49832
##       v88 49832
##       v93 49832
##       v94 49832
##       v96 49832
##       v99 49832
##      v103 49832
##      v104 49832
##      v111 49832
##      v126 49832
##      v127 49832
##        v2 49796
##        v4 49796
##       v17 49796
##       v44 49796
##       v48 49796
##       v59 49796
##       v61 49796
##       v64 49796
##       v76 49796
##      v101 49796
##      v106 49796
##       v87 48663
##      v105 48658
##       v98 48654
##       v70 48636
##        v5 48624
##       v36 48624
##       v81 48624
##       v82 48624
##      v108 48624
##      v109 48624
##      v117 48624
##      v128 48624
##        v8 48619
##       v25 48619
##       v46 48619
##       v54 48619
##       v63 48619
##       v89 48619
##      v124 48619
#md <- md.pattern(tmp)
#md <- data.frame(obs=as.numeric(rownames(md)),md) %>% 
#    arrange(-obs) 
#filter(md, obs > 1000 | is.na(obs))
#md.pairs(tmp)$mm

#Proportion of missingess by categorical values
feature.names <- c("v30","v113","v102","v8")
tmp <- train[feature.names]
tmp <- tmp %>% mutate(v30=ifelse(is.na(v30),'99',v30), 
                     v113=ifelse(is.na(v113),'99',v113),
                    miss102=ifelse(is.na(v102),1,0),
                     miss8=ifelse(is.na(v8),1,0))
table(tmp$v30,tmp$miss102)
##     
##          0     1
##   1   1865   448
##   2    143    62
##   3  22667  9511
##   4   4052  1173
##   5   2558   415
##   6   1912   677
##   7   6311  2417
##   99 23497 36613
table(tmp$v30,tmp$miss8)
##     
##          0     1
##   1   1908   405
##   2    147    58
##   3  23443  8735
##   4   4217  1008
##   5   2650   323
##   6   1983   606
##   7   6609  2119
##   99 24745 35365
table(tmp$v113,tmp$miss102)
##     
##          0     1
##   1    168   166
##   10   134   179
##   11   447   425
##   12     0     1
##   13   683   675
##   14   377   330
##   15    44    98
##   16   154   170
##   17   113   120
##   18  8764  7488
##   19    22    19
##   2      0     1
##   20  1444  1161
##   21   175   147
##   22   260   217
##   23  2190  5184
##   24   928   258
##   25    61    39
##   26  1116   859
##   27   320   274
##   28    99    82
##   29   331   283
##   3    639   634
##   30   432  1176
##   31   347   265
##   32   858   815
##   33   829   668
##   34   970   665
##   35   644   638
##   36   222   231
##   4   3135  2821
##   5    157   108
##   6    274   406
##   7   2038  1530
##   8    908   804
##   9    403   394
##   99 33319 21985
table(tmp$v113,tmp$miss8)
##     
##          0     1
##   1    173   161
##   10   148   165
##   11   472   400
##   12     0     1
##   13   713   645
##   14   401   306
##   15    47    95
##   16   158   166
##   17   116   117
##   18  9257  6995
##   19    23    18
##   2      0     1
##   20  1519  1086
##   21   191   131
##   22   269   208
##   23  2257  5117
##   24   970   216
##   25    63    37
##   26  1147   828
##   27   335   259
##   28   101    80
##   29   355   259
##   3    680   593
##   30   458  1150
##   31   359   253
##   32   919   754
##   33   848   649
##   34  1010   625
##   35   681   601
##   36   225   228
##   4   3242  2714
##   5    159   106
##   6    283   397
##   7   2156  1412
##   8    969   743
##   9    421   376
##   99 34577 20727
#prop.table(table(tmp$v30,tmp$v113))

rm(tmp, feature.names)

Distribution of the categorical features

Most of the categorical features except perhaps v52 and v107 maybe useful for prediction. We also note that the proportion of target=1 is larger in the missing value category for v30 and v113 than non-missing. Hence the missingness of a categorical feature is predictive.

For categorical features, our strategy is to replace missing category with a new categorical value.

#Convert target to factor and reverse order
train_fac$target <- factor(train_fac$target,levels=rev(levels(factor(as.factor(train_fac$target)))))

lvls <- names(level[level <= 50])
out <- NULL 
for(i in 1:length(lvls)){
   df <- train_fac[c("target",lvls[i])]
   out[[i]] <- ggplot(df, aes_string(colnames(df)[2], 
               fill = colnames(df)[1])) + geom_bar() 
}

grid.arrange(out[[1]], out[[2]], out[[3]], ncol = 3)

grid.arrange(out[[4]], out[[5]], out[[6]], ncol = 3)

grid.arrange(out[[7]], out[[8]], out[[9]], ncol = 3)

grid.arrange(out[[10]], out[[11]], out[[12]], ncol = 3)

grid.arrange(out[[13]], out[[14]], out[[15]], ncol = 3)

grid.arrange(out[[16]], ncol = 1)

rm(lvls,out,df,i)

train %>% select(target, v30) %>% 
  mutate(miss=ifelse(is.na(v30),1,0)) %>% 
  group_by(miss) %>% 
  summarize(mean(target))
## Source: local data frame [2 x 2]
## 
##    miss mean(target)
##   (dbl)        (dbl)
## 1     0    0.7588866
## 2     1    0.7632840
train %>% select(target, v113) %>% 
  mutate(miss=ifelse(is.na(v113),1,0)) %>% 
  group_by(miss) %>% 
  summarize(mean(target))
## Source: local data frame [2 x 2]
## 
##    miss mean(target)
##   (dbl)        (dbl)
## 1     0    0.7024586
## 2     1    0.8238825

Imputation of Missing Values

We impute missing value for features with less than 5% of missing value. We also recode categorical values with low counts to the category with the highest counts.

  1. v14 - replace negative values to 0.

  2. Convert categorical values to integers, and replace missing value with -1.

  3. Impute missing values for numeric features with less than 5% missing values.

  4. Add an indicator (vComplete) for complete case (observations without missing values) and indicator for each feature to indicate if value is present or missing. Replace missing values with -1.

  5. Recode categorical values with low counts.

#v14 - change values < 0 to 0
train_Imp <- mutate(train, v14=ifelse(v14<0,0,v14)) %>% select(-ID)

#Replace categorical features with integers and missing as -1
fac <- sapply(train_Imp, is.factor)
train_Imp[fac] <- lapply(train_Imp[fac],function(x){ifelse(is.na(x),as.integer(-1),as.integer(x))})

#Impute missing values for numeric features v10,v12,v14,v21,v34,v40, v50 and v114
#Use information for the categorical features as well
feature.names <- names(select(train_fac,-ID))

feature.names <- c(feature.names, "v10","v12","v14","v21","v34","v40","v50","v114")
summary(train_Imp[feature.names])
##      target             v3              v22             v24       
##  Min.   :0.0000   Min.   :-1.000   Min.   :   -1   Min.   :1.000  
##  1st Qu.:1.0000   1st Qu.: 3.000   1st Qu.: 4363   1st Qu.:3.000  
##  Median :1.0000   Median : 3.000   Median : 9340   Median :4.000  
##  Mean   :0.7612   Mean   : 2.875   Mean   : 9177   Mean   :4.058  
##  3rd Qu.:1.0000   3rd Qu.: 3.000   3rd Qu.:13638   3rd Qu.:5.000  
##  Max.   :1.0000   Max.   : 3.000   Max.   :18210   Max.   :5.000  
##                                                                   
##       v30              v31              v47              v52        
##  Min.   :-1.000   Min.   :-1.000   Min.   : 1.000   Min.   :-1.000  
##  1st Qu.:-1.000   1st Qu.: 1.000   1st Qu.: 3.000   1st Qu.: 4.000  
##  Median :-1.000   Median : 1.000   Median : 4.000   Median : 7.000  
##  Mean   : 1.326   Mean   : 1.168   Mean   : 5.606   Mean   : 6.558  
##  3rd Qu.: 3.000   3rd Qu.: 1.000   3rd Qu.: 9.000   3rd Qu.:10.000  
##  Max.   : 7.000   Max.   : 3.000   Max.   :10.000   Max.   :12.000  
##                                                                     
##       v56              v66             v71            v74       
##  Min.   : -1.00   Min.   :1.000   Min.   :1.00   Min.   :1.000  
##  1st Qu.: 22.00   1st Qu.:1.000   1st Qu.:2.00   1st Qu.:2.000  
##  Median : 51.00   Median :1.000   Median :5.00   Median :2.000  
##  Mean   : 58.59   Mean   :1.609   Mean   :4.05   Mean   :2.006  
##  3rd Qu.: 88.00   3rd Qu.:2.000   3rd Qu.:5.00   3rd Qu.:2.000  
##  Max.   :122.00   Max.   :3.000   Max.   :9.00   Max.   :3.000  
##                                                                 
##       v75             v79              v91              v107       
##  Min.   :1.000   Min.   : 1.000   Min.   :-1.000   Min.   :-1.000  
##  1st Qu.:2.000   1st Qu.: 3.000   1st Qu.: 2.000   1st Qu.: 2.000  
##  Median :4.000   Median : 3.000   Median : 3.000   Median : 3.000  
##  Mean   :3.314   Mean   : 5.038   Mean   : 3.597   Mean   : 3.335  
##  3rd Qu.:4.000   3rd Qu.: 5.000   3rd Qu.: 6.000   3rd Qu.: 5.000  
##  Max.   :4.000   Max.   :18.000   Max.   : 7.000   Max.   : 7.000  
##                                                                    
##       v110            v112             v113             v125      
##  Min.   :1.000   Min.   :-1.000   Min.   :-1.000   Min.   :-1.00  
##  1st Qu.:1.000   1st Qu.: 6.000   1st Qu.:-1.000   1st Qu.:26.00  
##  Median :2.000   Median : 9.000   Median : 4.000   Median :46.00  
##  Mean   :1.541   Mean   : 9.685   Mean   : 8.886   Mean   :46.72  
##  3rd Qu.:2.000   3rd Qu.:14.000   3rd Qu.:18.000   3rd Qu.:69.00  
##  Max.   :3.000   Max.   :22.000   Max.   :36.000   Max.   :90.00  
##                                                                   
##       v10              v12              v14             v21         
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.00   Min.   : 0.1062  
##  1st Qu.: 1.050   1st Qu.: 6.322   1st Qu.:11.26   1st Qu.: 6.4155  
##  Median : 1.313   Median : 6.613   Median :11.97   Median : 7.0454  
##  Mean   : 1.883   Mean   : 6.881   Mean   :12.09   Mean   : 7.0297  
##  3rd Qu.: 2.101   3rd Qu.: 7.020   3rd Qu.:12.72   3rd Qu.: 7.6706  
##  Max.   :18.534   Max.   :18.711   Max.   :20.00   Max.   :19.2961  
##  NA's   :84       NA's   :86       NA's   :4       NA's   :611      
##       v34              v40              v50               v114      
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.0000   Min.   : 0.00  
##  1st Qu.: 5.054   1st Qu.: 8.408   1st Qu.: 0.6588   1st Qu.:12.00  
##  Median : 6.537   Median :10.334   Median : 1.2119   Median :14.04  
##  Mean   : 6.406   Mean   :10.466   Mean   : 1.5043   Mean   :13.57  
##  3rd Qu.: 7.703   3rd Qu.:12.765   3rd Qu.: 2.0072   3rd Qu.:15.37  
##  Max.   :20.000   Max.   :20.000   Max.   :20.0000   Max.   :20.00  
##  NA's   :111      NA's   :111      NA's   :86        NA's   :30
set.seed(12345)
imputed = complete(mice(train_Imp[feature.names]))
## 
##  iter imp variable
##   1   1  v10  v12  v14  v21  v34  v40  v50  v114
##   1   2  v10  v12  v14  v21  v34  v40  v50  v114
##   1   3  v10  v12  v14  v21  v34  v40  v50  v114
##   1   4  v10  v12  v14  v21  v34  v40  v50  v114
##   1   5  v10  v12  v14  v21  v34  v40  v50  v114
##   2   1  v10  v12  v14  v21  v34  v40  v50  v114
##   2   2  v10  v12  v14  v21  v34  v40  v50  v114
##   2   3  v10  v12  v14  v21  v34  v40  v50  v114
##   2   4  v10  v12  v14  v21  v34  v40  v50  v114
##   2   5  v10  v12  v14  v21  v34  v40  v50  v114
##   3   1  v10  v12  v14  v21  v34  v40  v50  v114
##   3   2  v10  v12  v14  v21  v34  v40  v50  v114
##   3   3  v10  v12  v14  v21  v34  v40  v50  v114
##   3   4  v10  v12  v14  v21  v34  v40  v50  v114
##   3   5  v10  v12  v14  v21  v34  v40  v50  v114
##   4   1  v10  v12  v14  v21  v34  v40  v50  v114
##   4   2  v10  v12  v14  v21  v34  v40  v50  v114
##   4   3  v10  v12  v14  v21  v34  v40  v50  v114
##   4   4  v10  v12  v14  v21  v34  v40  v50  v114
##   4   5  v10  v12  v14  v21  v34  v40  v50  v114
##   5   1  v10  v12  v14  v21  v34  v40  v50  v114
##   5   2  v10  v12  v14  v21  v34  v40  v50  v114
##   5   3  v10  v12  v14  v21  v34  v40  v50  v114
##   5   4  v10  v12  v14  v21  v34  v40  v50  v114
##   5   5  v10  v12  v14  v21  v34  v40  v50  v114
train_Imp[feature.names] = imputed[feature.names]
#check that they are not missing anymore
summary(train_Imp[feature.names])
##      target             v3              v22             v24       
##  Min.   :0.0000   Min.   :-1.000   Min.   :   -1   Min.   :1.000  
##  1st Qu.:1.0000   1st Qu.: 3.000   1st Qu.: 4363   1st Qu.:3.000  
##  Median :1.0000   Median : 3.000   Median : 9340   Median :4.000  
##  Mean   :0.7612   Mean   : 2.875   Mean   : 9177   Mean   :4.058  
##  3rd Qu.:1.0000   3rd Qu.: 3.000   3rd Qu.:13638   3rd Qu.:5.000  
##  Max.   :1.0000   Max.   : 3.000   Max.   :18210   Max.   :5.000  
##       v30              v31              v47              v52        
##  Min.   :-1.000   Min.   :-1.000   Min.   : 1.000   Min.   :-1.000  
##  1st Qu.:-1.000   1st Qu.: 1.000   1st Qu.: 3.000   1st Qu.: 4.000  
##  Median :-1.000   Median : 1.000   Median : 4.000   Median : 7.000  
##  Mean   : 1.326   Mean   : 1.168   Mean   : 5.606   Mean   : 6.558  
##  3rd Qu.: 3.000   3rd Qu.: 1.000   3rd Qu.: 9.000   3rd Qu.:10.000  
##  Max.   : 7.000   Max.   : 3.000   Max.   :10.000   Max.   :12.000  
##       v56              v66             v71            v74       
##  Min.   : -1.00   Min.   :1.000   Min.   :1.00   Min.   :1.000  
##  1st Qu.: 22.00   1st Qu.:1.000   1st Qu.:2.00   1st Qu.:2.000  
##  Median : 51.00   Median :1.000   Median :5.00   Median :2.000  
##  Mean   : 58.59   Mean   :1.609   Mean   :4.05   Mean   :2.006  
##  3rd Qu.: 88.00   3rd Qu.:2.000   3rd Qu.:5.00   3rd Qu.:2.000  
##  Max.   :122.00   Max.   :3.000   Max.   :9.00   Max.   :3.000  
##       v75             v79              v91              v107       
##  Min.   :1.000   Min.   : 1.000   Min.   :-1.000   Min.   :-1.000  
##  1st Qu.:2.000   1st Qu.: 3.000   1st Qu.: 2.000   1st Qu.: 2.000  
##  Median :4.000   Median : 3.000   Median : 3.000   Median : 3.000  
##  Mean   :3.314   Mean   : 5.038   Mean   : 3.597   Mean   : 3.335  
##  3rd Qu.:4.000   3rd Qu.: 5.000   3rd Qu.: 6.000   3rd Qu.: 5.000  
##  Max.   :4.000   Max.   :18.000   Max.   : 7.000   Max.   : 7.000  
##       v110            v112             v113             v125      
##  Min.   :1.000   Min.   :-1.000   Min.   :-1.000   Min.   :-1.00  
##  1st Qu.:1.000   1st Qu.: 6.000   1st Qu.:-1.000   1st Qu.:26.00  
##  Median :2.000   Median : 9.000   Median : 4.000   Median :46.00  
##  Mean   :1.541   Mean   : 9.685   Mean   : 8.886   Mean   :46.72  
##  3rd Qu.:2.000   3rd Qu.:14.000   3rd Qu.:18.000   3rd Qu.:69.00  
##  Max.   :3.000   Max.   :22.000   Max.   :36.000   Max.   :90.00  
##       v10                 v12                 v14             v21         
##  Min.   :-0.000001   Min.   : 0.000001   Min.   : 0.00   Min.   : 0.1062  
##  1st Qu.: 1.050328   1st Qu.: 6.322328   1st Qu.:11.26   1st Qu.: 6.4165  
##  Median : 1.312910   Median : 6.612881   Median :11.97   Median : 7.0481  
##  Mean   : 1.882962   Mean   : 6.881225   Mean   :12.09   Mean   : 7.0345  
##  3rd Qu.: 2.100657   3rd Qu.: 7.019932   3rd Qu.:12.72   3rd Qu.: 7.6764  
##  Max.   :18.533920   Max.   :18.710550   Max.   :20.00   Max.   :19.2961  
##       v34                 v40              v50                 v114      
##  Min.   :-0.000001   Min.   : 0.000   Min.   :-0.000001   Min.   : 0.00  
##  1st Qu.: 5.054449   1st Qu.: 8.408   1st Qu.: 0.658792   1st Qu.:12.00  
##  Median : 6.537180   Median :10.334   Median : 1.211944   Median :14.04  
##  Mean   : 6.406581   Mean   :10.465   Mean   : 1.504475   Mean   :13.57  
##  3rd Qu.: 7.702727   3rd Qu.:12.764   3rd Qu.: 2.007189   3rd Qu.:15.37  
##  Max.   :20.000000   Max.   :20.000   Max.   :20.000000   Max.   :20.00
#Add indicator for each feature to indicate observation with missing value, and complete cases
feature.names <- names(select(train_num,-ID))
tmp <- data.frame(vComplete = as.integer(complete.cases(train_Imp))) 
train_Imp <- bind_cols(train_Imp, tmp) 

feature.names <- names(select(train_num,-c(ID,target,v10,v12,v14,v21,v34,v40,v50,v114)))
for(i in 1:length(feature.names)) {
    train_Imp[[paste(feature.names[i],"Miss", sep="_")]] <- apply(train_Imp[feature.names[i]],2,function(x){ifelse(is.na(x),1,0)})
}

#Recode NA to -1
train_Imp[feature.names] <- lapply(train_Imp[feature.names], function(x){ifelse(is.na(x),-1,x)})

#Recode categories with low counts
train_Imp$v47 <- ifelse(train_Imp$v47 %in% c(8), as.integer(9), train_Imp$v47)
train_Imp$v52 <- ifelse(train_Imp$v52 %in% c(-1), as.integer(10), train_Imp$v52)
train_Imp$v71 <- ifelse(train_Imp$v71 %in% c(1,4,6,7,8,9), as.integer(5), train_Imp$v71) 
train_Imp$v79 <- ifelse(train_Imp$v79 %in% c(1,6,7,8,10,12,14,17,18), as.integer(19), train_Imp$v79)
train_Imp$v91 <- ifelse(train_Imp$v91 %in% c(-1), as.integer(1), train_Imp$v91)
train_Imp$v113 <- ifelse(train_Imp$v113 %in% c(2,12,19), as.integer(18), train_Imp$v113)

rm(imputed, fac, tmp, feature.names,i)

Correlation

We looked at the association between the categorical features (pairwise only) and the correlation between the numeric featuress to see if we can reduce some of the features. We want features which are independent from each other but which the target variable is dependent on.

We measure the strength of the associations for the categorical features using Cramer-V statistics from the vcd package. There were a number of pairwise features with strong association (> 0.7). We have perfect relationship between v47 and v110, v79 and v110, v91 and v107. A cross tabulation of the features indicate that one of the features carry the same information. We drop variables v110, v107 as they are redundant.

out = data.frame(var1=character(), 
                 var2=character(), 
                 cramerV=double())

#Exclude features with high number of factors
feature.names <- names(select(train_fac,-c(ID,v22,v56,v125)))
tmp <- train_Imp[feature.names] 

for(i in 1:ncol(tmp)){
  for(j in 1:ncol(tmp))  
     if (i != j){
        tbl <- select(tmp,c(i,j))
        xtab <- xtabs(~., tbl)
        newrow = data.frame(var1=names(tmp[,i]), 
                            var2=names(tmp[,j]),
                            cramerV=assocstats(xtab)$cramer)
        out <- rbind(out, newrow)
    }
} 

out %>% filter(cramerV > 0.7) %>% arrange(-cramerV)
##    var1 var2   cramerV
## 1   v91 v107 1.0000000
## 2  v107  v91 1.0000000
## 3   v47 v110 0.9999910
## 4  v110  v47 0.9999910
## 5   v79 v110 0.9261114
## 6  v110  v79 0.9261114
## 7   v47  v79 0.8175226
## 8   v79  v47 0.8175226
## 9   v31 v110 0.7689007
## 10 v110  v31 0.7689007
## 11 v113 v110 0.7515573
## 12 v110 v113 0.7515573
## 13  v71  v75 0.7069755
## 14  v75  v71 0.7069755
table(tmp$v47,tmp$v110)   #Drop v110, as give same information. 2 in v110 same as v3
##     
##          1     2     3
##   1     38     0     0
##   2      0     0    50
##   3      0 55425     0
##   4      0     0  3157
##   5   5301     0     0
##   6   4322     0     0
##   7   3946     0     0
##   9  39071     1     0
##   10  3010     0     0
table(tmp$v79,tmp$v110)   #Drop v110 give same information
##     
##          1     2     3
##   2      0 25801     0
##   3  34561     0     0
##   4   5302     0     0
##   5      0 25257     0
##   9   4561     0     0
##   11  4308     0     0
##   13  3946     0     0
##   15     0  3331     0
##   16     0     0  2217
##   19  3010  1037   990
table(tmp$v91,tmp$v107)   #Drop v107 
##    
##        -1     1     2     3     4     5     6     7
##   1     3     0     0     0     0 27079     0     0
##   2     0     0 22683     0     0     0     0     0
##   3     0     0     0     0 23157     0     0     0
##   4     0     0     0     0     0     0     0   230
##   5     0     0     0     0     0     0  3206     0
##   6     0 13418     0     0     0     0     0     0
##   7     0     0     0 24545     0     0     0     0
rm(feature.names, tmp, out)

We use the corrplot package to visualise the correlation matrix for numeric features. We firstly looked the correlation between the non-missing numeric features using all the data. From the correlogram, we can see that some of the features are highly correlated.There are 5 pairs of features with the correlations > |0.7|.

#Correlation between non-missing numeric features
feature.names <- c("v10","v12","v14","v21","v34","v40","v50","v114")
tmp <- train_Imp[feature.names]
cor <- cor(tmp, use="everything")
col <- colorRampPalette(c("red", "white", "blue"))(10)
corrplot(cor, type="lower", order="hclust", col=col, tl.cex=1, tl.col="black", tl.srt=0)

cor[lower.tri(cor,diag=TRUE)] <- NA
cor <- as.data.frame(as.table(cor)) 
cor <- cor %>% 
          filter(!is.na(Freq)) %>% 
          arrange(Var1, Var2)

filter(cor, abs(Freq) > 0.7) 
##   Var1 Var2       Freq
## 1  v10  v12  0.9118438
## 2  v14  v21  0.8113106
## 3  v34  v40 -0.9030706
## 4  v34 v114  0.9116072
## 5  v40 v114 -0.9662656

A plot on a random sample of data shows some interesting patterns and relationships between the features on the scatter plots. v10 seems to be of some intervals and unusual pattern between v34 and v40, v12 and v50 and v40 and v114.

#take random subset of 5000 observations 
tmp <- train_Imp[c("target",feature.names)] %>% 
              mutate(target=factor(target)) %>%
              sample_n(5000) 

#Plot all pairwise features
out <- NULL
x <- as.character(cor$Var1)
y <- as.character(cor$Var2)
for(i in 1:nrow(cor)){
   df <- tmp[c("target",x[i],y[i])] 
   out[[i]] <- ggplot(df, aes_string(x=colnames(df)[2], 
                                     y=colnames(df)[3],
                                     fill="target")) +
                geom_point(cex=3, pch=21)  
}
grid.arrange(out[[1]], out[[2]], out[[3]], out[[4]], ncol = 4)

grid.arrange(out[[5]], out[[6]], out[[7]], out[[8]], ncol = 4)

grid.arrange(out[[9]], out[[10]], out[[11]], out[[12]], ncol = 4)

grid.arrange(out[[13]], out[[14]], out[[15]], out[[16]], ncol = 4)

grid.arrange(out[[17]], out[[18]], out[[19]], out[[20]], ncol = 4)

grid.arrange(out[[21]], out[[22]], out[[23]], out[[24]], ncol = 4)

grid.arrange(out[[25]], out[[26]], out[[27]], out[[28]], ncol = 4)

rm(feature.names,tmp,cor,col,high_cor,out,x,y,df)

Correlation between all numeric features

We then looked at the correlation between complete cases. From the correlogram, there are some features that are highly correlated. There is also clusters of features that are positive correlated, but negatively correlated with another group of features.

There are 61 pairs of features with correlation greater than 0.9. Some of the features are appear numerous times (v25 - 5 times, v8, v29, v33, v46 - 4 times) which suggest that the are linked.

For prediction of the target value, we want to choose numeric features that are correlated with the target but the features are not correlated.

# ALL numeric variables
feature.names <- names(select(train_num,-c(ID,target)))
tmp <- filter(train_Imp,vComplete == 1)[feature.names] 
cor <- cor(tmp, use="everything")
col<- colorRampPalette(c("red", "white", "blue"))(10)
corrplot(cor, type="lower", order="hclust", method="color", col=col, tl.cex=0.4, tl.col="black", tl.srt=0)

cor[lower.tri(cor,diag=TRUE)] <- NA
cor <- as.data.frame(as.table(cor)) 
cor <- cor %>% 
          filter(!is.na(Freq)) %>% 
          arrange(Freq)

#correlation greater than 0.9
high_cor <- filter(cor, abs(Freq) > 0.9)  
slice(high_cor, 1:10)
##    Var1 Var2       Freq
## 1   v58 v100 -0.9976676
## 2   v69 v115 -0.9936711
## 3   v48 v106 -0.9826985
## 4   v40 v114 -0.9671537
## 5   v48  v64 -0.9648004
## 6   v39  v68 -0.9473669
## 7   v17  v48 -0.9129205
## 8    v5  v81 -0.9053241
## 9   v55  v83  0.9004106
## 10  v83 v111  0.9019186
#Count number of times a feature is correlated with another
mutate(high_cor, Freq=1) %>% group_by(Var1) %>% summarise(count=sum(Freq)) %>% filter(count > 1) %>% arrange(-count)
## Source: local data frame [15 x 2]
## 
##      Var1 count
##    (fctr) (dbl)
## 1     v25     5
## 2      v8     4
## 3     v29     4
## 4     v33     4
## 5     v46     4
## 6     v17     3
## 7     v41     3
## 8     v54     3
## 9     v15     2
## 10    v26     2
## 11    v32     2
## 12    v48     2
## 13    v63     2
## 14    v64     2
## 15    v83     2
#take random subset of data
tmp <- train_Imp[c("target","vComplete",feature.names)] %>% 
            filter(vComplete==1) %>%
            mutate(target=factor(target)) %>%
            sample_n(5000) 
#Plot random sample of 28 pairs
cor <- cor %>% sample_n(28) 
out <- NULL
x <- as.character(cor$Var1)
y <- as.character(cor$Var2)
for(i in 1:nrow(cor)){
   df <- tmp[c("target",x[i],y[i])] 
   out[[i]] <- ggplot(df, aes_string(x=colnames(df)[2], 
                                     y=colnames(df)[3],
                                     fill="target")) +
                geom_point(cex=3, pch=21)  
}
grid.arrange(out[[1]], out[[2]], out[[3]], out[[4]], ncol = 4)

grid.arrange(out[[5]], out[[6]], out[[7]], out[[8]], ncol = 4)

grid.arrange(out[[9]], out[[10]], out[[11]], out[[12]], ncol = 4)

grid.arrange(out[[13]], out[[14]], out[[15]], out[[16]], ncol = 4)

grid.arrange(out[[17]], out[[18]], out[[19]], out[[20]], ncol = 4)

grid.arrange(out[[21]], out[[22]], out[[23]], out[[24]], ncol = 4)

grid.arrange(out[[25]], out[[26]], out[[27]], out[[28]], ncol = 4)

#Correlations between Target and numeric variables
feature.names <- names(select(train_num,-c(ID)))
tmp <- filter(train_Imp,vComplete == 1)[feature.names] 
cor <- cor(tmp, use="everything")
cor[lower.tri(cor,diag=TRUE)] <- NA
cor <- as.data.frame(as.table(cor)) 
#Retain correlation over 5%
cor <- cor %>% 
          filter(!is.na(Freq), Var1=="target", abs(Freq) > 0.06) %>% arrange(-abs(Freq))
slice(cor, 1:10)
##      Var1 Var2        Freq
## 1  target  v50  0.23244609
## 2  target  v10  0.14635601
## 3  target v129  0.14476307
## 4  target  v14  0.14422416
## 5  target  v34  0.09390153
## 6  target  v62 -0.09282448
## 7  target  v21  0.08204274
## 8  target  v72  0.07654414
## 9  target v114  0.06824029
## 10 target  v38  0.06265813
numeric.features <- as.character(t(cor$Var2))
rm(feature.names,tmp,cor,col,high_cor,out,x,y,df)

Prediction

We split the train data provided to 80% training and 20% test to assess the model accuracy.

1. Baseline Model

We know from the data that 76.1% of the claims were classified as suitable for accelerated approval. So our baseline model is to simply predict the probability of target = 1 as the proportion of target = 1 in the data. A baseline benchmark scores 0.54441 in the test data set.

#Functiom to evaluate the log loss metric
logLoss <- function(true_target, pred_prob){
    eps <- 1e-15
    pred_prob[pred_prob < eps] <- eps
    pred_prob[pred_prob > 1-eps] <- 1-eps
    out <- -1/length(true_target) * sum(true_target*log(pred_prob) + 
            (1-true_target)*log(1-(pred_prob)))
    names(out) <- "logLoss"
    out
}

#Split into training and test set (80:20)
set.seed(3234)  # for reproducibility
inTrain <- createDataPartition(y = train_Imp$target, p=0.8)
train_set <- slice(train_Imp, inTrain$Resample1)
test_set <- slice(train_Imp, -inTrain$Resample1)

#1. BaseLine prediction - Mean
mu <- mean(train_set$target)
pred <- rep(mu, nrow(test_set))
naive_loss<- logLoss(test_set$target, pred)
naive_loss
##   logLoss 
## 0.5444054
logLoss_results <- data_frame(Method = "Average", logLoss = naive_loss)

2. Logistic Regression, Model 1

We fit a logistic regression using features without missing values - categorical variables (excl. v22, v56,v125, v107,v110) and numeric features - v10,v12,v14,v21,v34,v40, v50 and v114. We improved the logloss to 0.4872.

library("rpart")
## Warning: package 'rpart' was built under R version 3.2.5
library("rpart.plot")
## Warning: package 'rpart.plot' was built under R version 3.2.5
#Logistic Regression, Model 1
feature.names <- names(select(train_fac,-c(ID,target,v22,v56, v125,v107,v110)))
feature.names <- c(feature.names, "v10","v12","v14","v21","v34","v40","v50","v114")
train_glm <- train_set[c("target",feature.names)]
test_glm <- test_set[c("target",feature.names)]

glm_fit1 <- glm(target ~., data=train_glm, family="binomial")
pred1 = predict(glm_fit1, newdata = test_glm, type = "response")
glm_loss1 <- logLoss(test_glm$target, pred1)
glm_loss1
##   logLoss 
## 0.4872247
logLoss_results <- bind_rows(logLoss_results,
                          data_frame(Method="Logit Model 1",  
                                     logLoss = glm_loss1))
rm(train_glm, test_glm)

3. Logistic Regression, Model 2

We included numeric features we thought might be predictive to our logistic model and added the dummy variable for missing values. We improved the logloss slightly to 0.4868.

feature.names <- names(select(train_fac,-c(ID,target,v22,v56, v125,v107,v110)))
feature.names <- c(feature.names, "v10","v12","v14","v21","v34","v40","v50","v114")
numeric.features <- c("v2","v4","v28","v36","v44","v54","v61","v63","v64","v81","v87","v100","v106","v98","v119","v123","v129")  

dummy.features = NULL
for(i in 1:length(numeric.features)) {
    dummy.features <- c(dummy.features, paste(numeric.features[i],"Miss", sep="_"))
}  
feature.names <- unique(c(feature.names, numeric.features, dummy.features))

train_glm <- train_set[c("target",feature.names)]
test_glm <- test_set[c("target",feature.names)]

glm_fit <- glm(target ~., data=train_glm, family="binomial")
pred2 = predict(glm_fit, newdata = test_glm, type = "response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
glm_loss2 <- logLoss(test_glm$target, pred2)
glm_loss2
##   logLoss 
## 0.4868249
logLoss_results <- bind_rows(logLoss_results,
                          data_frame(Method="Logit Model 2",  
                                     logLoss = glm_loss2))
rm(train_glm, test_glm)

4. Tree-Based Models

We tried two tree based model - decision tree and random forest.

In the classification tree, we used the rpart package choosing only the features with no missing value as in the logistic Model 1. We pruned the tree to avoid over-fitting. The first split is on v50 which we saw previously had the highest correlation with the target variable. We get a logloss of 0.4975 which is not as good as the logistic model.

feature.names <- names(select(train_fac,-c(ID,target,v22,v56,v125,v107,v110)))
feature.names <- c(feature.names,"v10","v12","v14","v21","v34","v40","v50","v114")
train_cart <- train_set[c("target", feature.names)]
test_cart <- test_set[c("target", feature.names)]

#Convert categorical features into factor
feature.names <- names(select(train_fac,-c(ID, v22,v56, v125,v107,v110)))
train_cart[feature.names] <- lapply(train_cart[feature.names], as.factor)
test_cart[feature.names] <- lapply(test_cart[feature.names], as.factor)

#Fit the model
cart_fit = rpart(target ~ ., data = train_cart,method="class", control = rpart.control(minsplit=20, cp = 0.0005))

# plot the tree 
printcp(cart_fit)
## 
## Classification tree:
## rpart(formula = target ~ ., data = train_cart, method = "class", 
##     control = rpart.control(minsplit = 20, cp = 5e-04))
## 
## Variables actually used in tree construction:
##  [1] v10  v112 v113 v114 v12  v21  v24  v30  v31  v34  v40  v47  v50  v52 
## [15] v66  v79  v91 
## 
## Root node error: 21945/91457 = 0.23995
## 
## n= 91457 
## 
##            CP nsplit rel error  xerror      xstd
## 1  0.01480975      0   1.00000 1.00000 0.0058851
## 2  0.00546822      3   0.95557 0.96031 0.0058031
## 3  0.00248348      6   0.93917 0.94094 0.0057616
## 4  0.00209615      8   0.93420 0.93862 0.0057566
## 5  0.00179236      9   0.93210 0.93739 0.0057539
## 6  0.00173160     12   0.92673 0.93128 0.0057406
## 7  0.00132149     16   0.91980 0.92882 0.0057352
## 8  0.00129870     17   0.91848 0.92987 0.0057375
## 9  0.00127592     19   0.91588 0.92951 0.0057367
## 10 0.00100251     20   0.91460 0.92832 0.0057341
## 11 0.00097972     22   0.91260 0.92709 0.0057313
## 12 0.00088859     26   0.90823 0.92700 0.0057311
## 13 0.00082023     28   0.90645 0.92673 0.0057305
## 14 0.00077466     31   0.90399 0.92677 0.0057306
## 15 0.00069872     33   0.90244 0.92618 0.0057293
## 16 0.00068353     36   0.90034 0.92622 0.0057294
## 17 0.00061517     40   0.89761 0.92750 0.0057322
## 18 0.00059239     43   0.89565 0.92627 0.0057295
## 19 0.00057720     44   0.89506 0.92714 0.0057314
## 20 0.00054682     47   0.89332 0.92736 0.0057319
## 21 0.00052404     48   0.89278 0.92736 0.0057319
## 22 0.00050125     50   0.89173 0.92691 0.0057309
## 23 0.00050000     51   0.89123 0.92718 0.0057315
plotcp(cart_fit)

par(mfrow=c(1,1), mar=c(1,1,1,1))
plot(cart_fit, uniform=T, compress=T, margin=0.1, branch=0.3)
text(cart_fit, use.n=T, digits=3, cex=0.6)

# prune the tree to the cp that minimises the error
pfit<- prune(cart_fit, cp= cart_fit$cptable[which.min(cart_fit$cptable[,"xerror"]),"CP"]) 
# plot the pruned tree 
par(mfrow=c(1,1), mar=c(1,1,1,1))
plot(pfit, uniform=T, compress=T, margin=0.1, branch=0.3)
text(pfit, use.n=T, digits=3, cex=0.6)

pred3 = predict(pfit, newdata = test_cart, type = "prob")[,2]
cart_loss <- logLoss(as.integer(test_cart$target)-1, pred3)
cart_loss
##   logLoss 
## 0.4975313
logLoss_results <- bind_rows(logLoss_results,
                          data_frame(Method="Decision Tree",  
                                     logLoss = cart_loss))

rm(train_cart, test_cart)
  1. We repeat the classification tree using random forests. Random forest generates a large number of bootstrapped trees and classifies an observation using each tree in this new “forest”. Surprising the logloss of 0.49015 loses out to the logistic regression when using the complete case features. Additional features did not improve the logLoss. This is before performing any parameters tuning.
library("randomForest")
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
feature.names <- names(select(train_fac,-c(ID,target,v22,v56,v125,v107,v110)))
feature.names <- c(feature.names,"v10","v12","v14","v21","v34","v40","v50","v114")
train_rf <- train_set[c("target",feature.names)]
test_rf <- test_set[c("target",feature.names)]

#Convert categorical features into factor
feature.names <- names(select(train_fac,-c(ID, v22,v56, v125,v107,v110)))
train_rf[feature.names] <- lapply(train_rf[feature.names], as.factor)
test_rf[feature.names] <- lapply(test_rf[feature.names], as.factor)

#Fit the model
rf_fit = randomForest(target ~ ., data = train_rf, ntree=200, nodesize=250, mtry=9, importance=TRUE)
plot(rf_fit)

varImpPlot(rf_fit, sort=T,main="VariableImportance")

var.Imp <- data.frame(importance(rf_fit, type=2))
var.Imp$Variables <- row.names(var.Imp)
var.Imp[order(var.Imp$MeanDecreaseGini, decreasing=T),]
##      MeanDecreaseGini Variables
## v50       2442.070134       v50
## v66        722.825683       v66
## v113       658.208236      v113
## v79        515.287728       v79
## v112       423.745541      v112
## v31        275.986120       v31
## v47        256.776633       v47
## v52        185.522127       v52
## v10        184.222232       v10
## v12        174.060250       v12
## v24        162.404674       v24
## v114       150.528028      v114
## v34        126.879858       v34
## v40        123.988029       v40
## v21        109.314066       v21
## v14         92.612291       v14
## v30         85.735695       v30
## v91         34.612551       v91
## v71         10.759146       v71
## v74          5.769893       v74
## v75          4.359533       v75
## v3           3.967014        v3
pred4 = predict(rf_fit, newdata = test_rf, type = "prob")[,2]
rf_loss <- logLoss(as.integer(test_rf$target)-1, pred4)
rf_loss
##   logLoss 
## 0.5000409
logLoss_results <- bind_rows(logLoss_results,
                          data_frame(Method="Random Forest",  
                                     logLoss = rf_loss))


logLoss_results
## Source: local data frame [5 x 2]
## 
##          Method   logLoss
##           (chr)     (dbl)
## 1       Average 0.5444054
## 2 Logit Model 1 0.4872247
## 3 Logit Model 2 0.4868249
## 4 Decision Tree 0.4975313
## 5 Random Forest 0.5000409

Summary

The goal of the project was to apply machine learning models to predict the category of claims. We tried three models - logistic regression, decision tree and random forest - using various combinations of features. The best logloss score achieved was around 0.49. Dexter’s Lab who won the Kaggle competition achieved a private leaderboard score of 0.42. We were a long way off!

Discussion

The anonymized dataset made the project so much more challenging and difficult. Not knowing what the features represents and the large number of features with high proportion of missing values increased the complexity. We do not believe that the missing values were random but is due to the types of claims (product types) and the information collected. Apart for some imputation for features with low proportion of missing values, we simply recoded the missing values to a new category/value.

Dexter’s Lab winning solution used xgboost model. XGBoost (extreme gradient boosting), is a tree-based model and is widely used in Kaggle competitions. Unfortunately we did not get to try it out in the project. Another key to winning the competition was feature engineering. By exploring the patterns in the data, Dexter’s Lab figured out the features which represented the dates and constructed a panel time series data.

We spent a major part of the project visualising the data and worrying about missing values. In hindsight, a better approach would be to create a simple random forest model and use variable importance to help us concentrate on visualizing the important features.